Petabyte-Scale Log Analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Organizations frequently encounter the challenge of managing an immense volume of data that continues to expand rapidly. Concurrently, they must optimize operational expenses to derive value from this data for timely insights while maintaining consistent performance. As data proliferates across various data stores, warehouses, and lakes, efficient management becomes critical.

With a modern data architecture on AWS, you can swiftly establish scalable data lakes, utilize a diverse range of purpose-built data services, and ensure compliance through unified access, security, and governance. This enables your systems to scale at a low cost without sacrificing performance, facilitating seamless data sharing across organizational boundaries—ultimately allowing for swift and agile decision-making at scale.

Organizations can consolidate data from multiple silos into a centralized data lake and execute analytics and machine learning (ML) directly on that data. Additionally, other data can be stored in specialized data repositories for rapid insights from both structured and unstructured datasets. This data movement can occur in various directions—inside-out, outside-in, or across the perimeter.

For instance, logs and traces from web applications can be directly collected into a data lake, with a subset of that data being transferred to a log analytics store like Amazon OpenSearch Service for daily analysis. This is an example of inside-out data movement. Subsequently, the analyzed and aggregated data stored in Amazon OpenSearch Service can be transferred back to the data lake for running ML algorithms for further application use—this is referred to as outside-in data movement.

Use Case: Example Corp.

Let’s consider a practical use case. Example Corp., a prominent Fortune 500 company specializing in social content, generates data and traces from hundreds of applications at an impressive rate of approximately 500 TB daily. Their requirements include:

Keeping logs readily available for fast analytics for 2 days.
Beyond 2 days, having data accessible in a storage tier that is available for analytics with a reasonable service level agreement (SLA).
Retaining data in cold storage for 30 days for compliance, auditing, and other purposes after one week.

In the sections below, we will explore three potential solutions to address similar use cases:

Tiered storage in Amazon OpenSearch Service and data lifecycle management
On-demand ingestion of logs using Amazon OpenSearch Ingestion
Direct queries from Amazon OpenSearch Service with Amazon Simple Storage Service (Amazon S3)

Solution 1: Tiered Storage in OpenSearch Service and Data Lifecycle Management

OpenSearch Service offers three integrated storage tiers: hot, UltraWarm, and cold storage. Depending on your data retention, query latency, and budgeting needs, you can select the most appropriate strategy to balance cost and performance. Data migration between different storage tiers is also feasible.

Hot storage is designed for indexing and updating, providing the quickest access to data. This storage takes the form of an instance store or Amazon Elastic Block Store (Amazon EBS) volumes linked to each node.

UltraWarm offers significantly lower costs per GiB for read-only data that is queried less frequently and does not require the same performance as hot storage. UltraWarm nodes utilize Amazon S3 along with caching solutions to enhance performance.

Cold storage is optimized for rarely accessed or historical data. When using cold storage, you detach your indexes from the UltraWarm tier, rendering them inaccessible until reattached, which can be done in seconds when you need to query that data.

For further details on data tiers within OpenSearch Service, refer to “Choose the right storage tier for your needs in Amazon OpenSearch Service.”

Solution Overview

The workflow for this solution consists of the following steps:

Incoming data generated by applications is streamed to an S3 data lake.
Data is ingested into Amazon OpenSearch using near-real-time ingestion via S3-SQS notifications set up on the S3 buckets.
After 2 days, hot data is migrated to UltraWarm storage to support read queries.
After 5 days in UltraWarm, the data is migrated to cold storage for 21 days and detached from any compute. The data can be reattached to UltraWarm when needed. Data is deleted from cold storage after 21 days.
Daily indexes are maintained for easy rollover. An Index State Management (ISM) policy automates the rollover or deletion of indexes older than 2 days.

The following is a sample ISM policy that rolls over data into the UltraWarm tier after 2 days, moves it to cold storage after 5 days, and deletes it from cold storage after 21 days:

{
    "policy": {
        "description": "hot warm delete workflow",
        "default_state": "hot",
        "schema_version": 1,
        "states": [
            {
                "name": "hot",
                "actions": [
                    {
                        "rollover": {
                            "min_index_age": "2d",
                            "min_primary_shard_size": "30gb"
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "warm"
                    }
                ]
            },
            {
                "name": "warm",
                "actions": [
                    {
                        "replica_count": {
                            "number_of_replicas": 5
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "cold",
                        "conditions": {
                            "min_index_age": "5d"
                        }
                    }
                ]
            },
            {
                "name": "cold",
                "actions": [
                    {
                        "retry": {
                            "count": 5,
                            "backoff": "exponential",
                            "delay": "1h"
                        },
                        "cold_migration": {
                            "start_time": null,
                            "end_time": null,
                            "timestamp_field": "@timestamp",
                            "ignore": "none"
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "delete",
                        "conditions": {
                            "min_index_age": "21d"
                        }
                    }
                ]
            },
            {
                "name": "delete",
                "actions": [
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "cold_delete": {}
                    }
                ],
                "transitions": []
            }
        ],
        "ism_template": {
            "index_patterns": [
                "log*"
            ],
            "priority": 100
        }
    }
}

Considerations

UltraWarm employs advanced caching techniques to facilitate querying of infrequently accessed data. While data access may be infrequent, the compute in a serious tone, making it about the same overall length.

For more insights on job descriptions and roles in this area, you can explore SHRM’s resource. Additionally, if you are considering career opportunities, visit Amazon’s job portal for excellent resources.

For further reading on unique experiences in your career journey, check out this blog post that keeps the reader engaged.

Petabyte-Scale Log Analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion | AWS Big Data Blog

Use Case: Example Corp.

Solution 1: Tiered Storage in OpenSearch Service and Data Lifecycle Management

Solution Overview

Considerations

Related Topics:

Comments

Leave a Reply Cancel reply